This notebook is a simple training pipeline in TensorFlow for the Cassava Leaf Competition where we are given 21,397 labeled images of cassava leaves classified as 5 different groups (4 diseases and a healthy group) and asked to predict on unseen images of cassava leaves. As with most image classification problems, we can use and experiment with many different forms of augmentation and we can explore transfer learning.

Note: I am using Dimitre’s TFRecords that can be found here. He also has 128x128, 256x256, and 384x384 sized images that I added for experimental purposes. Please give his datasets an upvote (and his work in general, it is excellent).

import numpy as np
import pandas as pd
import seaborn as sns
import albumentations as A
import matplotlib.pyplot as plt
import os, gc, cv2, random, warnings, math, sys, json, pprint, pdb

import tensorflow as tf
from tensorflow.keras import backend as K
import tensorflow_hub as hub

from sklearn.model_selection import train_test_split

warnings.simplefilter('ignore')
print(f"Using TensorFlow v{tf.__version__}")
Using TensorFlow v2.4.0

Tip: Adding seed helps reproduce results. Setting debug parameter wil run the model on smaller number of epochs to validate the architecture.
#@title Notebook type { run: "auto", display-mode:"form" }
SEED = 16
DEBUG = False #@param {type:"boolean"}
TRAIN = True #@param {type:"boolean"}

def seed_everything(seed=0):
    random.seed(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    os.environ['TF_DETERMINISTIC_OPS'] = '1'

GOOGLE = 'google.colab' in str(get_ipython())
KAGGLE = not GOOGLE

seed_everything(SEED)

print("Running on {}!".format(
   "Google Colab" if GOOGLE else "Kaggle Kernel"
))
Running on Google Colab!

Hyperparameters

#@title {run: "auto", display-mode: "form" }

BASE_MODEL= 'efficientnet_b3' #@param ["'efficientnet_b3'", "'efficientnet_b4'", "'efficientnet_b2'"] {type:"raw", allow-input: true}
BATCH_SIZE = 32 #@param {type:"integer"}
HEIGHT = 300#@param {type:"number"}
WIDTH = 300#@param {type:"number"}
CHANNELS = 3#@param {type:"number"}
IMG_SIZE = (HEIGHT, WIDTH, CHANNELS)
EPOCHS =  8#@param {type:"number"}
print("Using {} with input size {}".format(BASE_MODEL, IMG_SIZE))
Using efficientnet_b3 with input size (300, 300, 3)

Data

Exploring data

df = pd.read_csv(f'{input_path}train.csv')
df.head()
image_id label
0 1000015157.jpg 0
1 1000201771.jpg 3
2 100042118.jpg 1
3 1000723321.jpg 1
4 1000812911.jpg 3

Check how many images are available in the training dataset and also check if each item in the training set are unique

Number of training images: 21397
True

The distribution of labels is obviously unbalanced as can be observed in the figure below.

<matplotlib.axes._subplots.AxesSubplot at 0x7efc0095a550>

Let's preprocess to add the directory string to the filename and rename the column to filename

df['filename'] = df['image_id'].map(lambda x : f'{input_path}train_images/{x}')
df = df.drop(columns = ['image_id'])
df = df.sample(frac=1).reset_index(drop=True)
df.head()
label filename
0 3 /content/gdrive/MyDrive/kaggle/input/cassava-l...
1 3 /content/gdrive/MyDrive/kaggle/input/cassava-l...
2 3 /content/gdrive/MyDrive/kaggle/input/cassava-l...
3 1 /content/gdrive/MyDrive/kaggle/input/cassava-l...
4 3 /content/gdrive/MyDrive/kaggle/input/cassava-l...

Let's find out what labels do we have for the 5 categories.

{'0': 'Cassava Bacterial Blight (CBB)',
 '1': 'Cassava Brown Streak Disease (CBSD)',
 '2': 'Cassava Green Mottle (CGM)',
 '3': 'Cassava Mosaic Disease (CMD)',
 '4': 'Healthy'}

From the bar chart shown earlier, the label 3, Cassava Mosaic Disease (CMD) is the most common one. This imbalance may have to be addressed with a weighted loss function or oversampling. I might try this in a future iteration of this kernel or in a new kernel.

Let's check an example image to see what it looks like

The size of the image is W800 x H600

Loading data

After my quick and rough EDA, let's load the PIL Image to a Numpy array, so we can move on to data augmentation.

In fastai, they have item_tfms and batch_tfms defined for their data loader API. The item transforms performs a fairly large crop to 224 and also apply other standard augmentations (in aug_tranforms) at the batch level on the GPU. The batch size is set to 32 here.

Split the dataset into training set and validation set

train_df, valid_df = train_test_split(
    df
    ,test_size = 0.2
    ,random_state = SEED
    ,shuffle = True
    ,stratify = df['label'])
train_ds = tf.data.Dataset.from_tensor_slices(
    (train_df.filename.values,train_df.label.values))
valid_ds = tf.data.Dataset.from_tensor_slices(
    (valid_df.filename.values, valid_df.label.values))
adapt_ds = tf.data.Dataset.from_tensor_slices(
    train_df.filename.values)
for x,y in valid_ds.take(3): print(x, y)
tf.Tensor(b'/content/gdrive/MyDrive/kaggle/input/cassava-leaf-disease-classification/train_images/3227289141.jpg', shape=(), dtype=string) tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(b'/content/gdrive/MyDrive/kaggle/input/cassava-leaf-disease-classification/train_images/1494523424.jpg', shape=(), dtype=string) tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(b'/content/gdrive/MyDrive/kaggle/input/cassava-leaf-disease-classification/train_images/3290333742.jpg', shape=(), dtype=string) tf.Tensor(2, shape=(), dtype=int64)

Data transformation

In this stage we will collating the data and the label, and then do some basic data transformation so the image size can fit to the input size of the model.

Basically item transformations mainly make sure the input data is of the same size so that it can be collated in batches.

Important: You may have noticed that I had not used any kind of normalization or rescaling. I recently discovered that there is Normalization layer included in Keras’ pretrained EfficientNet, as mentioned here.
def decode_image(filename):
    img = tf.io.read_file(filename)
    img = tf.image.decode_jpeg(img, channels=3)
    return img
  
def collate_train(filename, label):
    img = decode_image(filename)
    img = tf.image.random_brightness(img, 0.3)
    img = tf.image.random_flip_left_right(img, seed=None)
    img = tf.image.random_crop(img, IMG_SIZE)
    return img, label

def process_adapt(filename):
    img = decode_image(filename)
    img = tf.keras.layers.experimental.preprocessing.Rescaling(1.0 / 255)(img)
    return img

def collate_valid(filename, label):
    img = decode_image(filename)
    img = tf.image.random_crop(img, IMG_SIZE)
    return img, label
train_ds = train_ds.map(collate_train, num_parallel_calls=AUTOTUNE)
valid_ds = valid_ds.map(collate_valid, num_parallel_calls=AUTOTUNE)
adapt_ds = adapt_ds.map(process_adapt, num_parallel_calls=AUTOTUNE)
train_ds_batch = (train_ds
                  .cache('dump.tfcache')
                  .shuffle(buffer_size=1000)
                  .batch(BATCH_SIZE)
                  .prefetch(buffer_size=AUTOTUNE))

valid_ds_batch = (valid_ds
                  #.shuffle(buffer_size=1000)
                  .batch(BATCH_SIZE*2)
                  .prefetch(buffer_size=AUTOTUNE))

adapt_ds_batch = (adapt_ds
                  .shuffle(buffer_size=1000)
                  .batch(BATCH_SIZE)
                  .prefetch(buffer_size=AUTOTUNE))
def show_images(ds):
    _,axs = plt.subplots(3,3,figsize=(16,16))
    for ((x, y), ax) in zip(ds.take(9), axs.flatten()):
        ax.imshow(x.numpy().astype(np.uint8))
        ax.set_title(np.argmax(y))
        ax.axis('off')

Show some training images

Show some validation images

Model

Batch augmentation

data_augmentation = tf.keras.Sequential(
    [
     tf.keras.layers.experimental.preprocessing.RandomCrop(HEIGHT, WIDTH),
     tf.keras.layers.experimental.preprocessing.RandomFlip("horizontal_and_vertical"),
     tf.keras.layers.experimental.preprocessing.RandomRotation(0.25),
     tf.keras.layers.experimental.preprocessing.RandomZoom((-0.2, 0)),
     tf.keras.layers.experimental.preprocessing.RandomContrast((0.2,0.2))
    ]
)
func = lambda x,y: (data_augmentation(x), y)
x = (train_ds
     .batch(BATCH_SIZE)
     .take(1)
     .map(func, num_parallel_calls=AUTOTUNE))
show_images(x.unbatch())

Building a model

I am using an EfficientNetB3 on top of which I add some output layers to predict our 5 disease classes. I decided to load the imagenet pretrained weights locally to keep the internet off (part of the requirements to submit a kernal to this competition).

%%run_if {GOOGLE}
from tensorflow.keras.applications import EfficientNetB3
from tensorflow.keras.applications import VGG16
def build_model(base_model, num_class):
    inputs = tf.keras.layers.Input(shape=IMG_SIZE)
    x = data_augmentation(inputs)
    x = base_model(x)
    x = tf.keras.layers.Dropout(0.4)(x)
    outputs = tf.keras.layers.Dense(num_class, activation="softmax", name="pred")(x)
    model = tf.keras.models.Model(inputs=inputs, outputs=outputs)
    return model
efficientnet = EfficientNetB3(
    weights = 'imagenet' if TRAIN else None, 
    include_top = False, 
    input_shape = IMG_SIZE, 
    pooling='avg')
efficientnet.trainable = True
model = build_model(base_model=efficientnet, num_class=len(id2label))
Downloading data from https://storage.googleapis.com/keras-applications/efficientnetb3_notop.h5
43941888/43941136 [==============================] - 0s 0us/step
model.summary()
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_2 (InputLayer)         [(None, 300, 300, 3)]     0         
_________________________________________________________________
sequential (Sequential)      (None, 300, 300, 3)       0         
_________________________________________________________________
efficientnetb3 (Functional)  (None, 1536)              10783535  
_________________________________________________________________
dropout (Dropout)            (None, 1536)              0         
_________________________________________________________________
pred (Dense)                 (None, 5)                 7685      
=================================================================
Total params: 10,791,220
Trainable params: 10,703,917
Non-trainable params: 87,303
_________________________________________________________________

Fine tune

The 3rd layer of the Efficient is the Normalization layer, which can be tuned to our new dataset instead of imagenet. Be patient on this one, it does take a bit of time as we're going through the entire training set.

%%run_if {GOOGLE and TRAIN}
if not os.path.exists("000_normalization.h5"):
    model.get_layer('efficientnetb3').get_layer('normalization').adapt(adapt_ds_batch)
    model.save_weights("000_normalization.h5")
else:
    model.load_weights("000_normalization.h5")

Optimizer : CosineDecay

Important: I always wanted to try the new CosineDecayRestarts function implemented in tf.keras as it seemed promising and I struggled to find the right settings (if there were any) for the ReduceLROnPlateau
%%run_if {TRAIN}
#@title { run: "auto", display-mode: "form" }
STEPS = math.ceil(len(train_df) / BATCH_SIZE) * EPOCHS
LR_START = 9e-3 #@param {type: "number"}
LR_START *= strategy.num_replicas_in_sync
LR_MIN = 3e-4 #@param {type: "number"}
N_RESTARTS =  5#@param {type: "number"}
T_MUL = 2.0 #@param {type: "number"}
M_MUL =  1#@param {type: "number"}
STEPS_START = math.ceil((T_MUL-1)/(T_MUL**(N_RESTARTS+1)-1) * STEPS)

schedule = tf.keras.experimental.CosineDecayRestarts(
    first_decay_steps=STEPS_START,
    initial_learning_rate=LR_START,
    alpha=LR_MIN,
    m_mul=M_MUL,
    t_mul=T_MUL)

x = [i for i in range(STEPS)]
y = [schedule(s) for s in range(STEPS)]

_,ax = plt.subplots(1,1,figsize=(8,5),facecolor='#F0F0F0')
ax.plot(x, y)
ax.set_facecolor('#F8F8F8')
ax.set_xlabel('iteration')
ax.set_ylabel('learning rate')

print('{:d} total epochs and {:d} steps per epoch'
        .format(EPOCHS, STEPS // EPOCHS))
print(schedule.get_config())
8 total epochs and 535 steps per epoch
{'initial_learning_rate': 0.009, 'first_decay_steps': 68, 't_mul': 2.0, 'm_mul': 1, 'alpha': 0.0003, 'name': None}

Warning: There is a gap between what I had expected and the acutal LearningRateScheduler that tensorflow gives us. The LearningRateScheduler update the lr on_epoch_begin while it makes more sense to do it on_batch_end or on_batch_begin.

Callbacks

LR finder

%%run_if {GOOGLE and TRAIN}
from tensorflow.keras.callbacks import Callback
class LRFinder(Callback):
    """`Callback` that exponentially adjusts the learning rate after
    each training batch between `start_lr` and `end_lr` for a maximum number
    of batches: `max_step`. The loss and learning rate are recorded at each
    step allowing visually finding a good learning rate as
    https://sgugger.github.io/how-do-you-find-a-good-learning-rate.html suggested.
    """

    def __init__(self, start_lr: float = 1e-7, end_lr: float = 10,
                 max_steps: int = 100, smoothing=0.9):
        super(LRFinder, self).__init__()
        self.start_lr, self.end_lr = start_lr, end_lr
        self.max_steps = max_steps
        self.smoothing = smoothing
        self.step, self.best_loss, self.avg_loss, self.lr = 0, 0, 0, 0
        self.lrs, self.losses = [], []

    def on_train_begin(self, logs=None):
        self.step, self.best_loss, self.avg_loss, self.lr = 0, 0, 0, 0
        self.lrs, self.losses = [], []

    def on_train_batch_begin(self, batch, logs=None):
        self.lr = self.exp_annealing(self.step)
        tf.keras.backend.set_value(self.model.optimizer.lr, self.lr)

    def on_train_batch_end(self, batch, logs=None):
        logs = logs or {}
        loss = logs.get('loss')
        step = self.step
        if loss:
            self.avg_loss = self.smoothing * self.avg_loss + (1 - self.smoothing) * loss
            smooth_loss = self.avg_loss / (1 - self.smoothing ** (self.step + 1))
            self.losses.append(smooth_loss)
            self.lrs.append(self.lr)

            if step == 0 or loss < self.best_loss:
                self.best_loss = loss

            if smooth_loss > 4 * self.best_loss or tf.math.is_nan(smooth_loss):
                self.model.stop_training = True

        if step == self.max_steps:
            self.model.stop_training = True

        self.step += 1

    def exp_annealing(self, step):
        return self.start_lr * (self.end_lr / self.start_lr) ** (step * 1. / self.max_steps)

    def plot(self, skip_end=None):
        lrs = self.lrs[:-skip_end] if skip_end else self.lrs[:-5]
        losses = self.losses[:-skip_end] if skip_end else self.losses[:-5]
        fig, ax = plt.subplots(1, 1, facecolor="#F0F0F0")
        ax.set_ylabel('Loss')
        ax.set_xlabel('Learning Rate')
        ax.set_xscale('log')
        ax.xaxis.set_major_formatter(plt.FormatStrFormatter('%.0e'))
        ax.plot(lrs, losses)
%%run_if {GOOGLE and TRAIN}
model.compile(loss="sparse_categorical_crossentropy",
              optimizer="adam",
              metrics=["accuracy"])
lr_finder = LRFinder()
_ = model.fit(train_ds_batch, epochs=1, callbacks=[lr_finder])
  6/535 [..............................] - ETA: 9:38 - loss: 1.7046 - accuracy: 0.2086WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.4558s vs `on_train_batch_end` time: 0.6337s). Check your callbacks.
535/535 [==============================] - 122s 200ms/step - loss: 14.9670 - accuracy: 0.3355
%%run_if {GOOGLE and TRAIN}
lr_finder.plot(skip_end=20)

As can be observed from the curve, we can pinpoint the lr_max to be 9e-3 and the lr_min to be 3e-4. Let's feed these hyperparams back to the optimizer schedule and retrain the model.

Before retraining, don't forget to reset the model so it can be trained from the 000_normalization.h5 rather than 1 epoch after it because executing the lr_finder

Tip: I will create a repo, tflearner and have this implemented as a .reset method of a learner class.
%%run_if {GOOGLE and TRAIN}
efficientnet = EfficientNetB3(
    weights = 'imagenet', 
    include_top = False, 
    input_shape = IMG_SIZE, 
    pooling='avg')
efficientnet.trainable = True
model = build_model(base_model=efficientnet, num_class=len(id2label))
model.load_weights("000_normalization.h5")

Others

%%run_if {TRAIN}
callbacks = [
    tf.keras.callbacks.ModelCheckpoint(
        filepath='001_best_model.h5',
        monitor='val_loss',
        save_best_only=True),
    ]
%%run_if {TRAIN}
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=tf.keras.optimizers.Adam(schedule),
              metrics=["accuracy"])

Training

%%run_if {TRAIN}
history = model.fit(train_ds_batch,
                    epochs = EPOCHS,
                    validation_data=valid_ds_batch,
                    callbacks=callbacks)
Epoch 1/8
535/535 [==============================] - 1104s 2s/step - loss: 1.3144 - accuracy: 0.5887 - val_loss: 2.8411 - val_accuracy: 0.6185
Epoch 2/8
535/535 [==============================] - 642s 1s/step - loss: 0.9649 - accuracy: 0.6520 - val_loss: 2.7552 - val_accuracy: 0.6407
Epoch 3/8
535/535 [==============================] - 641s 1s/step - loss: 0.9434 - accuracy: 0.6511 - val_loss: 0.8542 - val_accuracy: 0.6979
Epoch 4/8
535/535 [==============================] - 642s 1s/step - loss: 0.7773 - accuracy: 0.7152 - val_loss: 1.6173 - val_accuracy: 0.6336
Epoch 5/8
535/535 [==============================] - 643s 1s/step - loss: 0.8693 - accuracy: 0.6780 - val_loss: 2.3116 - val_accuracy: 0.6271
Epoch 6/8
535/535 [==============================] - 644s 1s/step - loss: 0.7839 - accuracy: 0.7174 - val_loss: 0.8465 - val_accuracy: 0.6956
Epoch 7/8
136/535 [======>.......................] - ETA: 7:39 - loss: 0.6950 - accuracy: 0.7478

Evaluating

def show_history(history):
    topics = ['loss', 'accuracy']
    groups = [{k:v for (k,v) in history.items() if topic in k} for topic in topics]
    _,axs = plt.subplots(1,2,figsize=(15,6),facecolor='#F0F0F0')
    for topic,group,ax in zip(topics,groups,axs.flatten()):
        for (_,v) in group.items(): ax.plot(v)
        ax.set_facecolor('#F8F8F8')
        ax.set_title(f'{topic} over epochs')
        ax.set_xlabel('epoch')
        ax.set_ylabel(topic)
        ax.legend(['train', 'valid'], loc='best')
%%run_if {TRAIN}
show_history(history.history)

We load the best weight that were kept from the training phase. Just to check how our model is performing, we will attempt predictions over the validation set. This can help to highlight any classes that will be consistently miscategorised.

model.load_weights('{}001_best_model.h5'.format(
    '' if TRAIN else '../input/cassava-leaf-disease-classification-models/'))

Prediction

x = train_df.sample(1).filename.values[0]
img = decode_image(x)
%%time
imgs = [tf.image.random_crop(img, size=IMG_SIZE) for _ in range(4)]

_,axs = plt.subplots(1,4,figsize=(16,4))
for (x, ax) in zip(imgs, axs.flatten()):
    ax.imshow(x.numpy().astype(np.uint8))
    ax.axis('off')
CPU times: user 61.7 ms, sys: 20 µs, total: 61.7 ms
Wall time: 60.5 ms

I apply some very basic test time augmentation to every local image extracted from the original 600-by-800 images. We know we can do some fancy augmentation with albumentations but I wanted to do that exclusively with Keras preprocessing layers to keep the cleanest pipeline possible.

tta = tf.keras.Sequential(
    [
        tf.keras.layers.experimental.preprocessing.RandomCrop(HEIGHT, WIDTH),
        tf.keras.layers.experimental.preprocessing.RandomFlip("horizontal_and_vertical"),
        tf.keras.layers.experimental.preprocessing.RandomZoom((-0.2, 0.2)),
        tf.keras.layers.experimental.preprocessing.RandomContrast((0.2,0.2))
    ]
)
def predict_tta(filename, num_tta=4):
    img = decode_image(filename)
    img = tf.expand_dims(img, 0)
    imgs = tf.concat([tta(img) for _ in range(num_tta)], 0)
    preds = model.predict(imgs)
    return preds.sum(0).argmax()
pred = predict_tta(df.sample(1).filename.values[0])
print(pred)
from tqdm import tqdm
preds = []
with tqdm(total=len(valid_df)) as pbar:
    for filename in valid_df.filename:
        pbar.update()
        preds.append(predict_tta(filename, num_tta=4))
cm = tf.math.confusion_matrix(valid_df.label.values, np.array(preds))
plt.figure(figsize=(10, 8))
sns.heatmap(cm,
            xticklabels=id2label.values(),
            yticklabels=id2label.values(), 
            annot=True,
            fmt='g',
            cmap="Blues")
plt.xlabel('Prediction')
plt.ylabel('Label')
plt.show()
test_folder = input_path + '/test_images/'
submission_df = pd.DataFrame(columns={"image_id","label"})
submission_df["image_id"] = os.listdir(test_folder)
submission_df["label"] = 0
submission_df['label'] = (submission_df['image_id']
                            .map(lambda x : predict_tta(test_folder+x)))
submission_df
submission_df.to_csv("submission.csv", index=False)

1% Better Everyday

reference


todos


done